The Rats Do Run This City¶

An analysis by Rayna Livingston and Laura Linne von Berg¶

Table of Contents¶

  1. Introduction
  2. Data Collection and Parsing
  3. Exploratory Analysis
  4. Hypothesis Testing
  5. Conclusion

1. Introduction¶

This past November, New York City Mayor, Eric Adams, signed the 'Rat Action Plan' in an attempt to cut down on the current rodent population. So far this year, the city's Sanitation Department has reported more than 21,600 rat complaints, an increase of around 71% since October 2020. Many officials blame the Covid-19 Pandemic for this shocking increase, stating that work-from-home and outdoor dining policies have led to more trash bags, of which the rats make a nice meal, on the sidewalks and streets. The 'Rat Action Plan' outlines several new rat-minimizing rules:

  1. The city will establish rat mitigation zones
  2. The city will require reports on progress in rat mitigation zones
  3. Trash bins in high risk areas will be replaced by a new design to minimize rodent entry
  4. Exterminating rodents will become a part of acquiring a construction permit

In addition, the New York City Sanitation Department has stated that effective April 1, 2023, citizens will be fined for putting their trash out before 8 p.m. for the overnight trash collectors. By trying to minimize the amount of time the rat's main food source is available, they hope to reduce the population.

Sources:

  • https://www1.nyc.gov/assets/dsny/site/resources/press-releases/department-of-sanitation-publishes-final-rules-to-drastically-reduce-hours-trash-will-sit-on-nyc-streets
  • https://therecount.com/watch/nyc-mayor-eric-adams-d/2645886608

We were very interested in exploring the data collected by the Department of Sanitation, particularly about location data. We originally were very curious to see if certain areas of the city had more of a problem than others, and if so, why? If there were some areas with far fewer rats, what is different about that area, or what are policymakers doing differently there? Are features such as tourism and population related to the number of rat sightings throughout NYC? To begin to evaluate these questions, lets take a look at our data.

flatiron-plaza-manhattan.jpeg

2. Data Collection and Parsing¶

In [1]:
#Libraries and imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import IFrame
import folium
import geopandas as gpd
from shapely.geometry import Point, Polygon
from datetime import datetime 
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

Reading the Data¶

Our first step of this process is to search for relevant data. We found a csv file of rat sightings from the year 2010 to 2022 on https://data.cityofnewyork.us/. This data is updated daily and gathered from Sanitation Department service requests. We will be using this to get location data on each rat sighting.

In [2]:
rat_path = "Rat_Sightings.csv"
df = pd.read_csv(rat_path, low_memory=False)

df = df[["Created Date", "Location Type", "Incident Zip", "Incident Address", "Borough", "Latitude", "Longitude"]]
df.sort_values(by=['Created Date'], inplace=True)
#test

Now we can add a year and month column from 'Created Date', so we can extract data pertaining to specific time periods.

In [3]:
format = '%m/%d/%Y %H:%M:%S %p'
# Extracting the year and month from 'Created Date' & making their own columns
df['Year'] = df['Created Date'].apply(lambda x: datetime.strptime(x, format).year)
df['Month'] = df['Created Date'].apply(lambda x: datetime.strptime(x, format).month)
df.head()
Out[3]:
Created Date Location Type Incident Zip Incident Address Borough Latitude Longitude Year Month
159386 01/01/2010 02:15:27 PM Vacant Building 11218.0 900 CONEY ISLAND AVENUE BROOKLYN 40.635654 -73.967757 2010 1
158069 01/01/2010 03:05:37 PM 3+ Family Apt. Building 11377.0 31-14 58 STREET QUEENS 40.756987 -73.903618 2010 1
158077 01/01/2010 04:14:27 PM 3+ Family Apt. Building 10467.0 2504 BRONX PARK EAST BRONX 40.863614 -73.870441 2010 1
159170 01/01/2010 08:29:58 AM 3+ Family Apt. Building 11206.0 202 PULASKI STREET BROOKLYN 40.692989 -73.943771 2010 1
54396 01/01/2010 08:52:19 PM Other (Explain Below) 11201.0 NaN BROOKLYN 40.688903 -73.980929 2010 1

This looks great! Our dataset has the following 10 columns:

  • Created Date
  • Location Type
  • Incident Zip
  • Incident Address
  • Borough
  • Latitude
  • Longitude
  • Year
  • Month

We can use the Year and Month data to look at how rat sightings change over time. The Location Type column can help us determine what types of places have the most frequent sightings. Zip Code, Borough, and Lat/Long will allow us to visualize and map exactly where the rats are.

To visualize the data, it will be beneficial to plot these coordinates. Any NaN values in the Latitude or Longitude columns will cause an error when we try to create a map. Therefore, we must remove rows that contain NaN values in these columns. Let's see how many rows are missing coordinate values:

In [4]:
missing_lats = len(df.index) - df[pd.notnull(df["Latitude"])]["Latitude"].count()
missing_longs = len(df.index) - df[pd.notnull(df["Longitude"])]["Longitude"].count()

print("Number of Latitude Values Missing:", missing_lats)
print("Number of Longitude Values Missing:", missing_longs)
Number of Latitude Values Missing: 2160
Number of Longitude Values Missing: 2160

We can easily drop the 2,160 columns which are missing location, however we may need these columns when we conduct borough analysis. To preserve this data we can just duplicate the dataframe before we drop these rows.

In [5]:
# coord_df is a dataframe that is missing the rows that don't have coordinate data
coord_df = df

# dropping the rows with missing coordinate data
coord_df = df.dropna(subset=["Latitude"])
coord_df = df.dropna(subset=["Longitude"])

3. Exploratory Analysis¶

Maps¶

The following code generates a map of NYC's rat sightings using coordinates from 2010-2021. There are two ways we can make a map: with folium or geopandas. Let's see which map is best for visualizing rat sighting data in NYC.

In [6]:
# geopandas allows us to plot of the coordinates in our dataframe
street_map = gpd.read_file("nyu_2451_34490")

geometry = [Point(xy) for xy in zip(df['Longitude'], df['Latitude'])]
geo_df = gpd.GeoDataFrame(df, 
                          crs = ('EPSG:4326'), 
                          geometry = geometry)

fig, ax = plt.subplots(figsize = (10,10))
street_map.to_crs(epsg=4326).plot(ax=ax, color='lightgrey')
geo_df.plot(ax=ax, cmap = 'RdPu', alpha = .5)

ax.set_title('NYC Rat Incident Heatmap')
Out[6]:
Text(0.5, 1.0, 'NYC Rat Incident Heatmap')

This approach help us visualize denisity throughout NYC. However, this may be too much information on one heatmap. Too many coordinate points make it difficult to deduct meaning from the heatmap. We don't know how time has influenced the location of rat sightings and there are so many coordinates that the majority of the map is the same dark purple color.

What is the best way of visualizing rat sightings for every year without becoming overwhelmed with coordinate points?

First let's make the heatmap with multiple colors to indicate density.

In [7]:
# function to generate base map, has default values for zoom and tiles
def generateBaseMap(loc, zoom=10, tiles='Stamen Toner', crs='ESPG2263'):
    '''
    Function that generates a Folium base map
    Input location lat/long
    Zoom level default 12
    Tiles default to Stamen Toner
    CRS default 2263 for NYC
    '''
    return folium.Map(location=loc, 
                      control_scale=True, 
                      zoom_start=zoom,
                      tiles=tiles)
  
nyc = [40.7400, -73.985880] # generic nyc lat/lon in list format
base_map = generateBaseMap(nyc) # pass lat/long to function
base_map

# create count column and populate with 1 for grouping and summing
coord_df.loc[:, 'Count'] = 1

# view latitude, longitude and count columns, groupby latitude and longitude summing up total for each xy coordinate
df2 = coord_df.copy()
df2 = pd.DataFrame(coord_df.groupby(['Latitude', 'Longitude'])['Count'].sum().sort_values(ascending=False))

# create list of lat/long for input into folium HeatMap
lst = df2.groupby(['Latitude', 'Longitude']).sum().reset_index().values.tolist()

# import HeatMap plugin
from folium.plugins import HeatMap

# add data to basemap which we created above with custom function
HeatMap(data=lst, radius=10).add_to(base_map);

# save base map as .html
base_map.save('./rat_sighting_HeatMap.html')

# call map 
base_map
Out[7]:
Make this Notebook Trusted to load map: File -> Trust Notebook

This graph is helpful because you can zoom in to see the rat sighting hotspots throughout the city from 2010-2022. For example, there are hotspots centered around the NYC Theater District and other popular tourist destinations.

However, without the passage of time be shown on the map, we lose a significant dimension in the data. Let's make a timelapsed heatmap to demonstrate the progression of rat sightings over time.

In [8]:
# create blank list
df_year_list = []

# loop through each month
for year in coord_df['Year'].sort_values().unique():
    # for each hour append to list  
    # sum totals per station, reset index and create list
    df_year_list.append(coord_df.loc[coord_df['Year'] == year, ['Latitude', 'Longitude', 'Count']].groupby(['Latitude', 'Longitude']).sum().reset_index().values.tolist()) 
   
from folium.plugins import HeatMapWithTime
# create new base map for heat map with time
base_map_2 = generateBaseMap(nyc)

# create a more meaningful index for heat map with time
start = datetime(2010, 1, 1)
end = datetime(2022, 12, 31)

# use pandas daterange function to generate date range object
daterange = pd.date_range(start=start, end=end, periods=13) 

# format time with month and year
time_index = [d.strftime("%Y") for d in daterange]

# instantiate HeatMapWithTime
HeatMapWithTime(df_year_list, radius=11, index=time_index,
                gradient={0.1: 'blue', 0.5: 'lime', 0.7: 'orange', 1: 'red'}, 
                min_opacity=0.4, max_opacity=0.8, use_local_extrema=True).add_to(base_map_2)

# save as html
base_map_2.save('./heatmapwithtime_ratsightings.html')

# call result
base_map_2
Out[8]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Frequency over Time¶

Another approach is to compare the frequency of rat sightings over time. We can make a bar graph that visualizes this.

In [9]:
format = '%m/%d/%Y %H:%M:%S %p'
# Extracting the year and month from 'Created Date' & making their own columns
df['year'] = df['Created Date'].apply(lambda x: datetime.strptime(x, format).year)
df['month'] = df['Created Date'].apply(lambda x: datetime.strptime(x, format).month)

# The frequency array keeps track of frequency per year.
# We can use this later to make a bar graph
frequency = []

# Each 'year' array contains:
    # index 0 ==> year_value (e.g 2011)
    # index 1 ==> array of rat sighting listings

for year in df.groupby('year'):
    frequency.append(len(year[1]))

years = df.Year.unique()
# This dataframe contains 12 rows & two columns representing # of rat sightings per year (frequency)
year_freq_df = pd.DataFrame({'Year': years, 'frequency': frequency})

# Plotting the graph
ax = df.hist(column='Year', bins=25, grid=False, color='#86bf91')
ax = ax[0]
for x in ax:
    x.set_title("Frequency per Year", size=12)

    # Set x-axis label
    x.set_xlabel("Year", labelpad=20, size=12)

    # Set y-axis label
    x.set_ylabel("Frequency", labelpad=20, size=12)

Time of Year¶

Now that we have seen the distribution of sightings over years, let's explore how the sightings are distributed throughout the year. Below, we will graph the sighting counts for each month of the year to see if there are certain months or seasons in which sightings are more common.

In [10]:
ax = df['Month'].value_counts().sort_index().plot(kind='bar', title='Time of Year', color='#86bf91')
ax.set_xlabel("Month")
ax.set_ylabel("Number of Sightings")
Out[10]:
Text(0, 0.5, 'Number of Sightings')

As we can see from the above bar graph, rat sightings are the lowest in colder months and highest in the warmer, summer months. This could be related to the increase in tourism or people outside in NYC during the summer months.

Analyzing Boroughs¶

NYC is one of the largest cities in the world. For administrative purposes, the city was split into five boroughs in 1898: Bronx, Brooklyn, Manhattan, Queens, and Staten Island.

Which borough has the highest number of rat sightings? We can easily determine this by graphing a bar graph of frequency per borough.

In [11]:
# The frequency array keeps track of frequency per year.
# We can use this later to make a bar graph
b_frequency = []
boroughs = []

for borough in df.groupby('Borough'):  
    if borough[0] != 'Unspecified':
        # Adding the frequency for each borough to an array
        borough_freq = len(borough[1])
        b_frequency.append(borough_freq)
        boroughs.append(borough[0])

borough_freq_df = pd.DataFrame({'boroughs': boroughs, 'frequency': b_frequency})   
#print(borough_freq_df)

# Plotting the graph
graph = borough_freq_df.plot.bar('boroughs', 'frequency', color='#86bf91') 
 
# Creating the title, limits, and labels 
graph.set_title('NYC Rat Sighting Frequency For Each Borough') 
graph.set_xlabel("Borough")  
graph.set_ylabel("Frequency")  

plt.show() 

#Pie chart
borough_density_df = borough_freq_df.copy()
borough_freq_df.set_index(['boroughs'], inplace=True)
colors = ['#2A2D34', '#009DDC', '#F26430', '#6761A8', '#009B72']
graph = borough_freq_df.plot.pie(y='frequency', legend = False, figsize=(7, 7),  autopct='%1.1f%%', title="\nNYC Rat Sighting Percentages For Each Borough", startangle=0, colors=colors)

As we can see, Brooklyn has the highest rat sighting frequency, however it is one of the larger boroughs in NYC so is to be expected. When comparing the number of rat sightings for each borough, we must also take the borough's size into consideration. How can we measure frequency while also incorporating a bourough's square mileage?

NYC Square Mileage Per Borough:

  • Bronx: 42.1
  • Brooklyn: 70.82
  • Manhattan: 22.83
  • Queens: 108.53
  • Staten Island: 58.37

We can divide the frequency by the square mileage to create a density value.

In [12]:
sq_mile_info = [42.1, 70.82, 22.83, 108.53, 58.37]

# added columns that indicate the square mileage and density for each borough
borough_density_df['sq_mile'] = sq_mile_info
borough_density_df['density'] = borough_density_df['frequency'] / borough_density_df['sq_mile']

print(borough_density_df)
        boroughs  frequency  sq_mile      density
0          BRONX      38742    42.10   920.237530
1       BROOKLYN      74547    70.82  1052.626377
2      MANHATTAN      54784    22.83  2399.649584
3         QUEENS      30580   108.53   281.765410
4  STATEN ISLAND       8586    58.37   147.096111
In [13]:
# Plotting the graph
graph = borough_density_df.plot.bar('boroughs', 'density', color='#86bf91') 
 
# Creating the title, limits, and labels 
graph.set_title('NYC Rat Sighting Density for Each Borough') 
graph.set_xlabel("Borough")  
graph.set_ylabel("Density")  

plt.show() 

#Pie chart
borough_density_df.set_index(['boroughs'], inplace=True)
colors = ['#2A2D34', '#009DDC', '#F26430', '#6761A8', '#009B72']
graph = borough_density_df.plot.pie(y='density', legend = False, figsize=(7, 7),  autopct='%1.1f%%', title="\nNYC Rat Sighting Density Percentages For Each Borough", startangle=0, colors=colors)

Manhattan is the borough with the highest number of rat sightings per square mile (density), with more than double the density of Brooklyn. This could be due to the high level of tourism that takes place in Manhattan. Manhattan is the most famous and most the visited borough, so there could be a correlation with tourism and rat density.

How do we measure tourism? What other factors could be in play here? Let's think of some related factors:

  • Medium Income (areas with high levels of tourism tend to be more expensive)
  • Number of Restaurants (restaurants thrive in areas which many visitors)
  • Population (higher number of residents can lead to more sightings of rats; even if it is the same rat)

We can get a better idea of what causes a substantial amount of rat sightings in an area by looking at where the rats are being seen.

Location of Rat Sightings¶

Let's take a closer look at the specific types of places where these rats are being spotted. To do this, we have to combine and condense certain location entries to make our data usable.

In [14]:
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['3+ Family Apt. Building','3+ Family Apartment Building','Apartment','3+ Family Apt.', '3+Family Apt.', '3+ Family Apt'], 'Apartment Building'))
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['1-3 Family Mixed Use Building', '1-2 Family Mixed Use Building', '3+ Family Mixed Use Building'], 'Mixed Use Building'))
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['1-2 FamilyDwelling', '1-3 Family Dwelling', '1-2 Family Dwelling'], 'Family Dwelling'))
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['Restaurant/Bar/Deli/Bakery', 'Store', 'Commercial Building', 'Catering Service', 'Retail Store', 'Restaurant', 'Grocery Store'], 'Commercial Property'))
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['Abandoned Building', 'Vacant Building', 'Vacant Lot'], 'Vacant Lot/Property'))
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['Street Fair Vendor', 'Ground', 'Street Area'], 'Street'))
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['Summer Camp', 'Cafeteria - Public School', 'School/Pre-School'], 'School'))
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['Residential Property','Residence','Private House','Single Room Occupancy (SRO)','Mixed Use Building', 'Family Dwelling', 'Apartment Building'], 'Residential Building'))
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['Other (Explain Below)'], 'Other'))

df['Location Type'].value_counts()
Out[14]:
Residential Building          143742
Other                          31527
Commercial Property            11042
Vacant Lot/Property             9093
Construction Site               4718
Parking Lot/Garage              2133
Catch Basin/Sewer               2069
Public Garden                    902
Government Building              464
Street                           442
School                           361
Day Care/Nursery                 219
Office Building                  193
Public Stairs                    178
Hospital                         133
Building (Non-Residential)        23
Beach                              2
Name: Location Type, dtype: int64

There were several entries spelled differently, and many that we could combine into larger categories (e.g. Abandoned Building, Vacant Building, Vacant Lot into Vacant Lot/Property). Now that we have a smaller number of unique entries to work with, we can take a look at the most common locations.

In [15]:
#Location of sightings
ax = df['Location Type'].value_counts().plot(kind='bar', title='Location of Rat Sightings', color='#86bf91')
ax.set_xlabel("Locations")
ax.set_ylabel("Number of Sightings")
Out[15]:
Text(0, 0.5, 'Number of Sightings')

We can see from the cumulative graph that Residential buildings are by far the most common location to see rats. The next common locations are Other, which does not give us information, Commercial Property, and Vacant Lot/Property. Note: This is only telling us where sightings are taking place, not where the rats are living.

Residential buildings consist of private homes, apartments, and mixed-use buildings (possibly hotels), all places that tourists may reside during their stay at NYC. However, the number of long-term residents may also have influence of the amount of rat sightings.

Knowing this information, we believe the rat sightings could be correlated with two features: resident population size & tourism. Let's investigate the cause of rat sightings through those two perspectives.

4. Hypothesis Testing¶

4.1 Looking at Median Income By ZIP Code¶

Now that we have taken a closer look at the rat sightings data, it is time to look at income in NYC. We are particularly interested in finding the median income per zip code and comparing this to the rat sightings in these areas. To find this data, we looked at https://data.cccnewyork.org/. From this site, we were able to find data for the years 2017, 2018, and 2019.

Let's test the hypothesis that an area with higher median income results in more rat sightings.

We can set our null and alternative hypotheses:

  • H0:β1=0
  • Ha:β1≠0

To perform a regression, we must import and clean up the data.

In [16]:
income_path = "Median_Incomes.csv"
income_df = pd.read_csv(income_path, low_memory=False)
income_df = income_df[income_df.Type == "All Households"]

income_df.drop(columns=['Zip Code XXXXX', 'Currency'], inplace=True)
income_df.sort_values(by=['Year', 'Zip_Code'], inplace=True)
income_df.Year.unique()

income_dict = {2017: {}, 2018: {}, 2019: {}}

for index, entry in income_df.iterrows(): 
    if entry.Year == 2017:
        income_dict[2017][entry.Zip_Code] = entry.Amount
    if entry.Year == 2018:
        income_dict[2018][entry.Zip_Code] = entry.Amount
    if entry.Year == 2019:
        income_dict[2019][entry.Zip_Code] = entry.Amount

#Dropping rows without zips
coord_df = df.dropna(subset=["Incident Zip"])        
missing_lats = len(coord_df.index) - coord_df[pd.notnull(coord_df["Incident Zip"])]["Incident Zip"].count()
df = coord_df

#Changing zipcodes to integers
df['Incident Zip'] = df['Incident Zip'].astype(int)  

#New income column
df.loc[:,'Income'] = 0
for index, rat in df.iterrows():      
    if rat['Year'] >= 2017 and rat['Year'] <= 2019:
        df.at[index, 'Income'] = income_dict[rat['Year']].get(rat['Incident Zip'], np.nan)

Now, we can create a dataframe which contains median incomes and their frequencies in the rat sightings data.

In [17]:
zip_df = df[df['Year'] >= 2017]
zip_df = zip_df[zip_df['Year'] <= 2019]
#Removing rows without a zipcode
missing_lats = len(zip_df.index) - zip_df[pd.notnull(zip_df["Income"])]["Income"].count()
zip_df = zip_df.dropna(subset=["Income"])

#DF with income and sighting frequency
freq_df = zip_df['Income'].value_counts().rename_axis('Income').reset_index(name='counts')
freq_df
Out[17]:
Income counts
0 91846.00000 637
1 82405.52493 625
2 87486.87128 533
3 58430.50554 518
4 49195.30471 500
... ... ...
525 126631.46840 1
526 129344.00000 1
527 105456.65220 1
528 104076.69810 1
529 143353.67870 1

530 rows × 2 columns

Lets create and evaluate a regression model.

In [37]:
#Creating X and Y using Income and Frequency of Rat Sightings
x, y = freq_df['Income'].values.reshape(-1, 1), freq_df['counts'].values.reshape(-1, 1)

#Fitting a regression model
model = LinearRegression().fit(x,y)
Y_pred = model.predict(x)

plt.scatter(x,y)
plt.plot(x, Y_pred, color='red')
plt.title("Median Income vs. Frequency of Rats Per Zipcode")
plt.xlabel(" Median Income")
plt.ylabel("Number of Rats")
plt.show()

#Stats for regression model
import statsmodels.api as sm

X = freq_df[['Income']].values
X = sm.add_constant(X)
y = freq_df['counts'].values

model = sm.OLS(y, X)
results = model.fit()
results.summary()
Out[37]:
OLS Regression Results
Dep. Variable: y R-squared: 0.085
Model: OLS Adj. R-squared: 0.084
Method: Least Squares F-statistic: 49.21
Date: Wed, 14 Dec 2022 Prob (F-statistic): 7.08e-12
Time: 17:33:47 Log-Likelihood: -3179.1
No. Observations: 530 AIC: 6362.
Df Residuals: 528 BIC: 6371.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 160.1522 9.605 16.675 0.000 141.284 179.020
x1 -0.0008 0.000 -7.015 0.000 -0.001 -0.001
Omnibus: 249.628 Durbin-Watson: 0.168
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1198.268
Skew: 2.112 Prob(JB): 6.30e-261
Kurtosis: 9.035 Cond. No. 1.91e+05


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.91e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
In [38]:
#Creating training and test sets
x_train, x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)

clf = LinearRegression()
clf.fit(x_train,y_train)
print("Accuracy of trained model:", clf.score(x_test,y_test))
Accuracy of trained model: 0.06943129150183591

At significance level 0.05, we can reject the null hypothesis. This indicates that the frequency of rat sightings has a relationship with the medium income. However, our R-squared value is extremely low (0.085), meaning that this model explains approximately none of the variation in our data.

As we can see, the accuracy of our trained regression model is extremely bad.

4.2 Looking at the Number of Restaurants in NYC¶

We also chose to use the number of restaurants in an area as an indiction of tourism. To gather this information, we used the dataset here: https://www.kaggle.com/datasets/new-york-city/nyc-inspections

Let's test the hypothesis that an area with more restaurants results in more rat sightings.

We can set our null and alternative hypotheses:

  • H0:β1=0
  • Ha:β1≠0
In [20]:
rest_path = "DOHMH_New_York_City_Restaurant_Inspection_Results.csv"
rest_df = pd.read_csv(rest_path, low_memory=False)

rest_df = rest_df[["DBA", "BORO", "ZIPCODE", "GRADE", "RECORD DATE"]]
format = '%m/%d/%Y'
# Extracting the year and month from 'Created Date' & making their own columns
rest_df['Year'] = rest_df['RECORD DATE'].apply(lambda x: datetime.strptime(x, format).year)
rest_df = rest_df[rest_df['Year'] >= 2010]
rest_df = rest_df.dropna(subset=["ZIPCODE"])     
rest_df['ZIPCODE'] = rest_df['ZIPCODE'].astype(int)
rest_df
Out[20]:
DBA BORO ZIPCODE GRADE RECORD DATE Year
0 NaN Queens 11104 NaN 12/10/2022 2022
1 SAFUR Queens 11101 NaN 12/10/2022 2022
2 GLAZE BRYANT PARK Manhattan 10018 NaN 12/10/2022 2022
3 HOEK PIZZERIA Brooklyn 11201 NaN 12/10/2022 2022
4 PANINI LA CAFE Brooklyn 11211 Z 12/10/2022 2022
... ... ... ... ... ... ...
230706 CHARLEY ST Manhattan 10012 NaN 12/10/2022 2022
230707 SEAMORE'S Brooklyn 11201 A 12/10/2022 2022
230708 SUMA PIZZA Bronx 10451 A 12/10/2022 2022
230709 IZAKAYA FUKU Queens 11372 NaN 12/10/2022 2022
230710 WOK WOK RESTAURANT Manhattan 10013 NaN 12/10/2022 2022

227499 rows × 6 columns

In [21]:
rest_freq_df = rest_df['ZIPCODE'].value_counts().rename_axis('Zip').reset_index(name='Rest_Counts')
rest_freq_df['Rat_Count'] = rest_freq_df['Zip'].apply(lambda z: len(df[df['Incident Zip'] == z]))
rest_freq_df =rest_freq_df[rest_freq_df['Rat_Count'] !=0]
rest_freq_df
Out[21]:
Zip Rest_Counts Rat_Count
0 10003 5589 1312
1 10013 5140 1202
2 10019 4967 848
3 10036 4466 907
4 10002 4433 2242
... ... ... ...
200 12345 11 3
204 10168 9 1
207 11241 6 1
214 10048 4 1
216 10151 3 1

196 rows × 3 columns

In [39]:
x, y = rest_freq_df['Rest_Counts'].values.reshape(-1, 1), rest_freq_df['Rat_Count'].values.reshape(-1, 1)
model = LinearRegression().fit(x,y)
r_sq = model.score(x, y)

Y_pred = model.predict(x)

plt.scatter(x,y)
plt.plot(x, Y_pred, color='red')
plt.title("Number of Restaurants vs. Frequency of Rats Per Zipcode")
plt.xlabel("Number of Restaurants")
plt.ylabel("Number of Rats")
plt.show()

import statsmodels.api as sm
X = rest_freq_df[['Rest_Counts']].values
X = sm.add_constant(X)
y = rest_freq_df['Rat_Count'].values

model = sm.OLS(y, X)
results = model.fit()
results.summary()
Out[39]:
OLS Regression Results
Dep. Variable: y R-squared: 0.122
Model: OLS Adj. R-squared: 0.118
Method: Least Squares F-statistic: 27.08
Date: Wed, 14 Dec 2022 Prob (F-statistic): 4.95e-07
Time: 17:33:59 Log-Likelihood: -1641.1
No. Observations: 196 AIC: 3286.
Df Residuals: 194 BIC: 3293.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 645.0690 108.888 5.924 0.000 430.314 859.825
x1 0.3537 0.068 5.204 0.000 0.220 0.488
Omnibus: 83.268 Durbin-Watson: 1.392
Prob(Omnibus): 0.000 Jarque-Bera (JB): 244.794
Skew: 1.831 Prob(JB): 6.98e-54
Kurtosis: 7.069 Cond. No. 2.32e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.32e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [40]:
#Creating training and test sets
x_train, x_test,y_train,y_test = train_test_split(x,y,test_size =0.3)

clf = LinearRegression()
clf.fit(x_train,y_train)
print("Accuracy of trained model:", clf.score(x_test,y_test))
Accuracy of trained model: 0.0654050906923549

At significance level 0.05, we can reject the null hypothesis. This indicates that the frequency of rat sightings has a relationship with the number of restaurants in the area. However, our R-squared value is extremely low (0.122), meaning that this model explains approximately none of the variation in our data.

Again, our trained regression model is quite poor.

4.3 Looking at Population¶

In addition to investigating tourism, we wanted to look at the general population in each zip code in relation to rat sightings. To gather this information, we used the dataset here: https://www.unitedstateszipcodes.org/zip-code-database/

Let's test the hypothesis that an area with higher population results in more rat sightings.

We can set our null and alternative hypotheses:

  • H0:β1=0
  • Ha:β1≠0
In [29]:
pop_path = "zip_code_populations.csv"
zipcode_pop_df = pd.read_csv(pop_path, low_memory=False)

# Sorting the dataframe by zipcode & removing non-NYC data
zipcode_pop_df.sort_values(by=['zip'], inplace=True)
zipcode_pop_df['Rat_Count'] = zipcode_pop_df['zip'].apply(lambda z: len(df[df['Incident Zip'] == z]))
zipcode_pop_df = zipcode_pop_df[zipcode_pop_df['Rat_Count'] !=0]
zipcode_pop_df.head()
Out[29]:
zip type decommissioned primary_city acceptable_cities unacceptable_cities state county timezone area_codes world_region country latitude longitude irs_estimated_population Rat_Count
3761 10001 STANDARD 0 New York NaN Empire State, Gpo, Greeley Square, Macys Finan... NY New York County America/New_York 718,917,347,646 NaN US 40.75 -74.00 21080 793
3762 10002 STANDARD 0 New York Knickerbocker Manhattan, New York City, NY, Ny City, Nyc NY New York County America/New_York 718 NaN US 40.71 -73.99 63960 2242
3763 10003 STANDARD 0 New York NaN Cooper, Manhattan, New York City, NY, Ny City,... NY New York County America/New_York 212,347,646,718,917 NaN US 40.73 -73.99 35660 1312
3764 10004 STANDARD 0 New York Bowling Green Manhattan, New York City, NY, Ny City, Nyc NY New York County America/New_York 212,347,646,718,917 NaN US 40.69 -74.02 3970 85
3765 10005 STANDARD 0 New York Wall Street Manhattan, New York City, NY, Ny City, Nyc NY New York County America/New_York 347,718,212,646,917 NaN US 40.71 -74.01 7800 69
In [31]:
x, y = zipcode_pop_df['irs_estimated_population'].values.reshape(-1, 1), zipcode_pop_df['Rat_Count'].values.reshape(-1, 1)
model = LinearRegression().fit(x,y)
r_sq = model.score(x, y)

Y_pred = model.predict(x)

plt.scatter(x,y)
plt.plot(x, Y_pred, color='red')
plt.title("Population vs. Frequency of Rats Per Zipcode")
plt.xlabel("Population")
plt.ylabel("Number of Rats")
plt.show()

import statsmodels.api as sm
X = zipcode_pop_df[['irs_estimated_population']].values
X = sm.add_constant(X)
y = zipcode_pop_df['Rat_Count'].values

model = sm.OLS(y, X)
results = model.fit()
results.summary()
Out[31]:
OLS Regression Results
Dep. Variable: y R-squared: 0.335
Model: OLS Adj. R-squared: 0.331
Method: Least Squares F-statistic: 100.0
Date: Wed, 14 Dec 2022 Prob (F-statistic): 2.43e-19
Time: 18:40:19 Log-Likelihood: -1654.8
No. Observations: 201 AIC: 3314.
Df Residuals: 199 BIC: 3320.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 95.6337 113.439 0.843 0.400 -128.062 319.330
x1 0.0261 0.003 10.002 0.000 0.021 0.031
Omnibus: 94.159 Durbin-Watson: 1.282
Prob(Omnibus): 0.000 Jarque-Bera (JB): 392.203
Skew: 1.863 Prob(JB): 6.83e-86
Kurtosis: 8.741 Cond. No. 7.63e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.63e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In [32]:
#Creating training and test sets
x_train, x_test,y_train,y_test = train_test_split(x,y,test_size =0.3)

clf = LinearRegression()
clf.fit(x_train,y_train)
print("Accuracy of trained model:", clf.score(x_test,y_test))
Accuracy of trained model: 0.2620997636028911

At significance level 0.05, we can reject the null hypothesis. This indicates that the frequency of rat sightings has a relationship with the population in the area. In addition, our R-squared value is relatively high (0.335), meaning that our model explains a fair amount of variability (33.5%).

Additionally, our trained regression model performed better...but still bad!

5. Conclusion¶

Challenges¶

There were several challenges we faced during this process:

  • We lacked significant numeric data (We had mainly categorical data)
  • Many of our data points were dependent on each other (Zip Code, Coordinates, and Street Address)
  • Our dataset contained self-reported rat sightings, so this was likely an underestimate of sightings
  • Rat sightings does not directly translate to number of rats in NYC
  • Some entries were lacking coordinate data, so we were unable to plot them

Insight¶

  • Rat sightings are most often reported during the summer months
  • About 70% of rats are reported in residential buildings (apartments, mixed-use buildings, hotels, etc.)
  • Brooklyn has the highest number of rat sightings from 2010-2022, however Manhattan has an extremely high rat sighting density (rat sighting frequency per square mile)
  • Annual rat sightings increased by 10,000 from 2020 to 2021

Moving Forward¶

Throughout our analysis we investigated the relationship between rat sighting frequency and three factors: median income, number of restaurants, and population per ZIP code in NYC. We found little to no relationship between all of these features, however, population vs. rat sighting frequency was the most correlated.

It may be interesting in the future to look at how the new policies will impact rat sightings around New York City. With the new intitiative, we hope more data becomes available to analyze. It may also be beneficial to explore other factors related to tourism. We could not find a correlation between certain features, such as median income and number of restaurants, in our data and rat sightings, but would like to discover others.

In [ ]: